Code
library(DBI)
library(RSQLite)
library(tidyverse)Tony Duan
This document explains how to connect to and interact with SQL databases from within R. We will cover writing raw SQL queries, using the tidyverse-friendly dbplyr package, and other best practices.

The DBI package provides a universal, consistent interface for connecting to any database. Specific packages like RSQLite, RPostgres, or odbc provide the actual drivers to connect to different database systems.
For these examples, we will use RSQLite, which creates a lightweight, file-based SQL database.
First, we connect to a database file (it will be created if it doesn’t exist). Then, we write the mtcars and iris R data frames into the database as new tables.
# Add car names as a column in mtcars
mtcars_df =cbind(car_name = rownames(mtcars), mtcars)
# Create a connection to an SQLite database file
con <- dbConnect(RSQLite::SQLite(), "data/my_sql_database.db")
# Write the R data frames to the database
# `overwrite = TRUE` will replace the table if it already exists
dbWriteTable(con, "mtcars", mtcars_df, overwrite = TRUE)
dbWriteTable(con, "iris", iris, overwrite = TRUE)We can list the tables to confirm they were created.
The dbGetQuery() function sends a SQL statement to the database and returns the result as a data frame.
Select all columns from the mtcars table, limiting the result to the first 3 rows.
Use AS to rename a column in the output. You can also create new columns based on calculations.
The WHERE clause filters rows based on a condition.
             car_name  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1          Duster 360 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
2  Cadillac Fleetwood 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
3 Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
4   Chrysler Imperial 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
5          Camaro Z28 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
6      Ford Pantera L 15.8   8  351 264 4.22 3.170 14.50  0  1    5    4GROUP BY collapses multiple rows into a single summary row. It’s used with aggregate functions like AVG(), COUNT(), MIN(), and MAX().
This query calculates the average horsepower and row count for each cylinder group.
ORDER BY sorts the result set in ascending (ASC) or descending (DESC) order.
This example creates two tables, car_parts and part_info, to demonstrate joins.
Returns only the rows where the join key (part_id) exists in both tables.
Returns all rows from the left table (car_parts) and the matched rows from the right table (part_info).
SQL has functions for manipulating strings and dates. The exact syntax can vary by database. In SQLite, you can use UPPER() and strftime().
UPDATE modifies existing records in a table.
CREATE TABLE builds a new table, and DROP TABLE deletes it.
[1] 0[1] "car_parts"   "iris"        "mtcars"      "mtcars_copy" "part_info"  dbplyrdbplyr is a tidyverse backend for databases. It allows you to write familiar dplyr code, which dbplyr translates into SQL for you. This is the recommended modern approach as it is easier to write and less prone to SQL injection errors.
First, we create a connection to the mtcars table in the database.
Now, we can use dplyr verbs on this object. The code is not run in R; it is translated to SQL and sent to the database.
We can use show_query() to see the SQL that dbplyr generated.
<SQL>
SELECT `cyl`, AVG(`mpg`) AS `avg_mpg`
FROM (
  SELECT `car_name`, `mpg`, `hp`, `cyl`
  FROM `mtcars`
  WHERE (`hp` > 150.0)
) AS `q01`
GROUP BY `cyl`
ORDER BY `avg_mpg` DESCTo execute the query and pull the results back into an R data frame, we use collect().
To prevent SQL injection attacks, you should not use paste() to insert values into a SQL query. Instead, use parameterized queries with ? as a placeholder. DBI will safely insert the values for you.
            car_name  hp
1  Hornet Sportabout 175
2         Duster 360 245
3         Merc 450SE 180
4         Merc 450SL 180
5        Merc 450SLC 180
6 Cadillac Fleetwood 205sqldf on Local Data FramesThe sqldf package allows you to run SQL queries on local R data frames, not on an external database. This can be useful for users who are more comfortable with SQL than with dplyr syntax for data manipulation.
It is very important to close the database connection when you are finished to release resources.
---
title: "Working with SQL Databases in R"
author: "Tony Duan"
execute:
  warning: false
  error: false
format:
  html:
    toc: true
    toc-location: right
    code-fold: show
    code-tools: true
    number-sections: false
    code-block-bg: true
    code-block-border-left: "#31BAE9"
---
This document explains how to connect to and interact with SQL databases from within R. We will cover writing raw SQL queries, using the `tidyverse`-friendly `dbplyr` package, and other best practices.
{width="600"}
# 1. Connecting to a Database
The `DBI` package provides a universal, consistent interface for connecting to any database. Specific packages like `RSQLite`, `RPostgres`, or `odbc` provide the actual drivers to connect to different database systems.
For these examples, we will use `RSQLite`, which creates a lightweight, file-based SQL database.
```{r}
library(DBI)
library(RSQLite)
library(tidyverse)
```
## Create and Populate a Database
First, we connect to a database file (it will be created if it doesn't exist). Then, we write the `mtcars` and `iris` R data frames into the database as new tables.
```{r}
# Add car names as a column in mtcars
mtcars_df =cbind(car_name = rownames(mtcars), mtcars)
# Create a connection to an SQLite database file
con <- dbConnect(RSQLite::SQLite(), "data/my_sql_database.db")
# Write the R data frames to the database
# `overwrite = TRUE` will replace the table if it already exists
dbWriteTable(con, "mtcars", mtcars_df, overwrite = TRUE)
dbWriteTable(con, "iris", iris, overwrite = TRUE)
```
We can list the tables to confirm they were created.
```{r}
dbListTables(con)
```
# 2. Method 1: Writing Raw SQL Queries
The `dbGetQuery()` function sends a SQL statement to the database and returns the result as a data frame.
## SELECT Statement
Select all columns from the `mtcars` table, limiting the result to the first 3 rows.
```{r}
sql_select <- "SELECT * FROM mtcars LIMIT 3"
dbGetQuery(con, sql_select)
```
## Column Aliasing and Creation
Use `AS` to rename a column in the output. You can also create new columns based on calculations.
```{r}
sql_create_col <- "SELECT mpg, mpg + 1 AS mpg_plus_one FROM mtcars"
head(dbGetQuery(con, sql_create_col))
```
## Filtering with WHERE
The `WHERE` clause filters rows based on a condition.
```{r}
sql_filter <- "SELECT * FROM mtcars WHERE hp > 200 AND cyl = 8"
head(dbGetQuery(con, sql_filter))
```
## Aggregation with GROUP BY
`GROUP BY` collapses multiple rows into a single summary row. It's used with aggregate functions like `AVG()`, `COUNT()`, `MIN()`, and `MAX()`.
This query calculates the average horsepower and row count for each cylinder group.
```{r}
sql_groupby <- "SELECT cyl, AVG(hp) AS avg_hp, COUNT(*) AS num_cars FROM mtcars GROUP BY cyl"
dbGetQuery(con, sql_groupby)
```
## Sorting with ORDER BY
`ORDER BY` sorts the result set in ascending (`ASC`) or descending (`DESC`) order.
```{r}
sql_orderby <- "SELECT car_name, mpg, hp FROM mtcars ORDER BY hp DESC LIMIT 5"
dbGetQuery(con, sql_orderby)
```
## Joins
This example creates two tables, `car_parts` and `part_info`, to demonstrate joins.
```{r}
dbWriteTable(con, "car_parts", data.frame(car_name = c("Mazda RX4", "Datsun 710", "Hornet 4 Drive"), part_id = c(1, 2, 3)), overwrite = TRUE)
dbWriteTable(con, "part_info", data.frame(part_id = c(1, 2, 4), part_name = c("Engine", "Tire", "Brake")), overwrite = TRUE)
```
### INNER JOIN
Returns only the rows where the join key (`part_id`) exists in both tables.
```{r}
sql_inner_join <- "
SELECT cp.car_name, pi.part_name
FROM car_parts cp
INNER JOIN part_info pi ON cp.part_id = pi.part_id"
dbGetQuery(con, sql_inner_join)
```
### LEFT JOIN
Returns all rows from the left table (`car_parts`) and the matched rows from the right table (`part_info`).
```{r}
sql_left_join <- "
SELECT cp.car_name, pi.part_name
FROM car_parts cp
LEFT JOIN part_info pi ON cp.part_id = pi.part_id"
dbGetQuery(con, sql_left_join)
```
## String and Date Functions
SQL has functions for manipulating strings and dates. The exact syntax can vary by database. In SQLite, you can use `UPPER()` and `strftime()`.
```{r}
sql_string_date <- "SELECT car_name, UPPER(car_name) AS upper_name, '2023-10-27' AS today from car_parts"
head(dbGetQuery(con, sql_string_date))
```
## Modifying Data
### UPDATE Table
`UPDATE` modifies existing records in a table.
```{r}
# Note: dbExecute is used for statements that don't return data
dbExecute(con, "UPDATE mtcars SET hp = 120 WHERE car_name = 'Mazda RX4'")
dbGetQuery(con, "SELECT car_name, hp FROM mtcars WHERE car_name = 'Mazda RX4'")
```
### CREATE and DROP Table
`CREATE TABLE` builds a new table, and `DROP TABLE` deletes it.
```{r}
sql_create <- "CREATE TABLE mtcars_copy AS SELECT * FROM mtcars"
dbExecute(con, sql_create)
dbListTables(con)
```
```{r}
sql_drop <- "DROP TABLE mtcars_copy"
dbExecute(con, sql_drop)
dbListTables(con)
```
# 3. Method 2: Using `dbplyr`
`dbplyr` is a `tidyverse` backend for databases. It allows you to write familiar `dplyr` code, which `dbplyr` translates into SQL for you. This is the recommended modern approach as it is easier to write and less prone to SQL injection errors.
First, we create a connection to the `mtcars` table in the database.
```{r}
mtcars_db <- tbl(con, "mtcars")
```
Now, we can use `dplyr` verbs on this object. The code is not run in R; it is translated to SQL and sent to the database.
```{r}
query <- mtcars_db %>%
  select(car_name, mpg, hp, cyl) %>%
  filter(hp > 150) %>%
  group_by(cyl) %>%
  summarise(avg_mpg = mean(mpg, na.rm = TRUE)) %>%
  arrange(desc(avg_mpg))
```
We can use `show_query()` to see the SQL that `dbplyr` generated.
```{r}
show_query(query)
```
To execute the query and pull the results back into an R data frame, we use `collect()`.
```{r}
results <- collect(query)
results
```
# 4. Best Practice: Parameterized Queries
To prevent SQL injection attacks, you should not use `paste()` to insert values into a SQL query. Instead, use parameterized queries with `?` as a placeholder. `DBI` will safely insert the values for you.
```{r}
hp_threshold <- 150
sql_safe <- "SELECT car_name, hp FROM mtcars WHERE hp > ?"
# dbGetQuery can take a `params` list
dbGetQuery(con, sql_safe, params = list(hp_threshold)) %>% head()
```
# 5. Method 3: Using `sqldf` on Local Data Frames
The `sqldf` package allows you to run SQL queries on local R data frames, not on an external database. This can be useful for users who are more comfortable with SQL than with `dplyr` syntax for data manipulation.
```{r}
library(sqldf)
# Note the use of the R data frame `mtcars_df`
sqldf("SELECT car_name, mpg FROM mtcars_df WHERE cyl = 8 LIMIT 5")
```
# 6. Disconnecting from the Database
It is very important to close the database connection when you are finished to release resources.
```{r}
dbDisconnect(con)
```